fix(tests): detect PG graceful switchover at PostgreSQL level#100
fix(tests): detect PG graceful switchover at PostgreSQL level#100renecannao wants to merge 2 commits intomasterfrom
Conversation
…a cluster view The post-merge CI run surfaced that the inherited "primary did not change" check is unreliable for PG graceful takeover. Because the demoted primary keeps running until the operator restarts it with standby.signal, orchestrator sees two cluster roots afterward (one per host, each RO=false), and "find RO=false in original cluster" returns the same host both before and after — producing a spurious failure. Replace with direct PostgreSQL-level verification: - pgstandby1 has pg_is_in_recovery()=false (promoted) - pgprimary has default_transaction_read_only=on (demoted) Existing writability and primary_conninfo checks remain. The round-trip section now gates on SWITCHOVER_OK from the direct check and discovers the new cluster name dynamically (pgstandby1 becomes a new cluster root after the first takeover, so the round-trip API call must target its cluster, not the original PG_CLUSTER). The round-trip completion check also uses pg_is_in_recovery() on pgprimary instead of the split-brain cluster view.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 35 minutes and 45 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis change refactors a PostgreSQL graceful switchover test by replacing orchestrator API-based topology verification with direct PostgreSQL-level assertions. The test now validates instance promotion using Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the PostgreSQL functional tests to verify switchover success through direct database queries instead of relying on Orchestrator's cluster view, which can be inconsistent during graceful takeovers. It also updates the topology re-unification logic to dynamically track cluster names. The review feedback suggests initializing variables to ensure compatibility with shell safety flags and optimizing performance by reducing redundant API calls within polling loops.
|
|
||
| # Verify orchestrator sees pgstandby1 as primary and pgprimary as standby | ||
| TOPOLOGY_OK=false | ||
| NEW_CLUSTER="" |
There was a problem hiding this comment.
It is good practice to initialize all variables used in comparisons, especially since set -u is enabled at the beginning of the script. While PRIMARY_CLUSTER is assigned inside the loop, initializing it here alongside NEW_CLUSTER improves clarity and safety.
| NEW_CLUSTER="" | |
| NEW_CLUSTER="" | |
| PRIMARY_CLUSTER="" |
| NEW_CLUSTER=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null | python3 -c " | ||
| import json, sys | ||
| for inst in json.load(sys.stdin): | ||
| if not inst.get('ReadOnly', True): | ||
| print(inst['Key']['Hostname']) | ||
| if inst['Key']['Hostname'] == '172.30.0.21': | ||
| print(inst.get('ClusterName', '')) | ||
| sys.exit(0) | ||
| " 2>/dev/null || echo "") | ||
| if [ "$PRIMARY_HOST" = "172.30.0.21" ] || [ "$PRIMARY_HOST" = "pgstandby1" ]; then | ||
| TOPOLOGY_OK=true | ||
| # Verify pgprimary (172.30.0.20) joined the same cluster as pgstandby1 | ||
| PRIMARY_CLUSTER=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null | python3 -c " | ||
| import json, sys | ||
| for inst in json.load(sys.stdin): | ||
| if inst['Key']['Hostname'] == '172.30.0.20': | ||
| print(inst.get('ClusterName', '')) | ||
| sys.exit(0) | ||
| " 2>/dev/null || echo "") |
There was a problem hiding this comment.
The script calls the /api/all-instances endpoint twice within each iteration of the loop. Since this endpoint returns data for all instances, it is more efficient to fetch the JSON once and then parse it for both hostnames. This reduces the number of API calls and potential network overhead during the test.
| NEW_CLUSTER=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null | python3 -c " | |
| import json, sys | |
| for inst in json.load(sys.stdin): | |
| if not inst.get('ReadOnly', True): | |
| print(inst['Key']['Hostname']) | |
| if inst['Key']['Hostname'] == '172.30.0.21': | |
| print(inst.get('ClusterName', '')) | |
| sys.exit(0) | |
| " 2>/dev/null || echo "") | |
| if [ "$PRIMARY_HOST" = "172.30.0.21" ] || [ "$PRIMARY_HOST" = "pgstandby1" ]; then | |
| TOPOLOGY_OK=true | |
| # Verify pgprimary (172.30.0.20) joined the same cluster as pgstandby1 | |
| PRIMARY_CLUSTER=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null | python3 -c " | |
| import json, sys | |
| for inst in json.load(sys.stdin): | |
| if inst['Key']['Hostname'] == '172.30.0.20': | |
| print(inst.get('ClusterName', '')) | |
| sys.exit(0) | |
| " 2>/dev/null || echo "") | |
| ALL_INSTANCES=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null) | |
| NEW_CLUSTER=$(echo "$ALL_INSTANCES" | python3 -c " | |
| import json, sys | |
| for inst in json.load(sys.stdin): | |
| if inst['Key']['Hostname'] == '172.30.0.21': | |
| print(inst.get('ClusterName', '')) | |
| sys.exit(0) | |
| " 2>/dev/null || echo "") | |
| # Verify pgprimary (172.30.0.20) joined the same cluster as pgstandby1 | |
| PRIMARY_CLUSTER=$(echo "$ALL_INSTANCES" | python3 -c " | |
| import json, sys | |
| for inst in json.load(sys.stdin): | |
| if inst['Key']['Hostname'] == '172.30.0.20': | |
| print(inst.get('ClusterName', '')) | |
| sys.exit(0) | |
| " 2>/dev/null || echo "") |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/functional/test-postgresql.sh (1)
289-316:⚠️ Potential issue | 🟠 MajorAvoid mutating
pgstandby1after a failed round-trip.Because this script intentionally continues after
failcalls, a failedBACK_CODEor failedBACK_PROMOTEDcheck still falls through to reactivatingpgstandby1. That can corrupt the setup for the downstream dead-primary failover test and obscure the original failure.Proposed fix
if [ "$BACK_CODE" = "OK" ]; then pass "Round-trip graceful takeover API returned OK" else fail "Round-trip graceful takeover returned: $BACK_CODE — $BACK_RESULT" fi - sleep 10 + BACK_OK=false + if [ "$BACK_CODE" = "OK" ]; then + sleep 10 - # Verify pgprimary is now promoted (not in recovery) - BACK_PROMOTED=$($COMPOSE exec -T pgprimary psql -U postgres -tAc "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]') - if [ "$BACK_PROMOTED" = "f" ]; then - pass "Round-trip complete: pgprimary is primary again" - else - fail "Round-trip incomplete: pgprimary pg_is_in_recovery='$BACK_PROMOTED' (expected f)" + # Verify pgprimary is now promoted (not in recovery) + BACK_PROMOTED=$($COMPOSE exec -T pgprimary psql -U postgres -tAc "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]') + if [ "$BACK_PROMOTED" = "f" ]; then + pass "Round-trip complete: pgprimary is primary again" + BACK_OK=true + else + fail "Round-trip incomplete: pgprimary pg_is_in_recovery='$BACK_PROMOTED' (expected f)" + fi fi # After round-trip, pgstandby1 is the demoted primary — reactivate # it as a live standby so the downstream failover-kill test has a # replica to promote. - echo "Reactivating pgstandby1 as a live standby of pgprimary..." - $COMPOSE exec -T pgstandby1 bash -c 'touch /var/lib/postgresql/data/standby.signal && chown postgres:postgres /var/lib/postgresql/data/standby.signal' || true - $COMPOSE restart pgstandby1 + if [ "$BACK_OK" = "true" ]; then + echo "Reactivating pgstandby1 as a live standby of pgprimary..." + $COMPOSE exec -T pgstandby1 bash -c 'touch /var/lib/postgresql/data/standby.signal && chown postgres:postgres /var/lib/postgresql/data/standby.signal' || true + $COMPOSE restart pgstandby1 + else + skip "Skipping pgstandby1 reactivation — round-trip did not complete" + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-postgresql.sh` around lines 289 - 316, The script currently always reactivates pgstandby1 even if the round-trip checks failed; wrap the reactivation block (the lines touching pgstandby1: the touch/chown and $COMPOSE restart pgstandby1) in a guard that only runs when both BACK_CODE = "OK" and BACK_PROMOTED = "f" (i.e. after the pass calls), or alternatively add an immediate exit/return after the fail calls; reference BACK_CODE, BACK_PROMOTED, the pass/fail calls and the pgstandby1 reactivation commands to locate and change the logic so no mutation happens when the round-trip failed.
🧹 Nitpick comments (1)
tests/functional/test-postgresql.sh (1)
255-287: Use one topology snapshot per polling iteration.
NEW_CLUSTERandPRIMARY_CLUSTERare read from separate/api/all-instancescalls while rediscovery is actively changing state. Comparing values from different snapshots can make this poll flaky; fetch once and derive both values from the same JSON payload.Proposed refactor
NEW_CLUSTER="" for i in $(seq 1 30); do - NEW_CLUSTER=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null | python3 -c " + ALL_INSTANCES=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null || echo "[]") + NEW_CLUSTER=$(printf '%s' "$ALL_INSTANCES" | python3 -c " import json, sys for inst in json.load(sys.stdin): if inst['Key']['Hostname'] == '172.30.0.21': print(inst.get('ClusterName', '')) sys.exit(0) " 2>/dev/null || echo "") # Verify pgprimary (172.30.0.20) joined the same cluster as pgstandby1 - PRIMARY_CLUSTER=$(curl -s --max-time 10 "$ORC_URL/api/all-instances" 2>/dev/null | python3 -c " + PRIMARY_CLUSTER=$(printf '%s' "$ALL_INSTANCES" | python3 -c " import json, sys for inst in json.load(sys.stdin): if inst['Key']['Hostname'] == '172.30.0.20': print(inst.get('ClusterName', '')) sys.exit(0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-postgresql.sh` around lines 255 - 287, The loop currently calls /api/all-instances twice and assigns to NEW_CLUSTER and PRIMARY_CLUSTER from separate snapshots, causing flakiness; change to fetch the topology once per iteration (e.g., store the curl output into a local variable like SNAPSHOT_JSON) and then extract both NEW_CLUSTER and PRIMARY_CLUSTER from that same JSON payload (use the existing python3 parsing blocks or jq) so both values come from the identical snapshot before comparing them and before any re-seed calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/functional/test-postgresql.sh`:
- Around line 155-177: The gate SWITCHOVER_OK is set true after checking
PROMOTED only, allowing later steps to proceed even if DEMOTED_RO check fails;
change the logic so SWITCHOVER_OK is only set to true when both conditions
succeed: check PROMOTED (pg_is_in_recovery=false) and then check DEMOTED_RO
(SHOW default_transaction_read_only = "on"), and only set SWITCHOVER_OK=true
after the DEMOTED_RO branch confirms success (or compute SWITCHOVER_OK as the
conjunction of the two checks using PROMOTED and DEMOTED_RO variables); update
references to SWITCHOVER_OK accordingly so downstream round-trip runs only when
both promotion and demotion succeeded.
---
Outside diff comments:
In `@tests/functional/test-postgresql.sh`:
- Around line 289-316: The script currently always reactivates pgstandby1 even
if the round-trip checks failed; wrap the reactivation block (the lines touching
pgstandby1: the touch/chown and $COMPOSE restart pgstandby1) in a guard that
only runs when both BACK_CODE = "OK" and BACK_PROMOTED = "f" (i.e. after the
pass calls), or alternatively add an immediate exit/return after the fail calls;
reference BACK_CODE, BACK_PROMOTED, the pass/fail calls and the pgstandby1
reactivation commands to locate and change the logic so no mutation happens when
the round-trip failed.
---
Nitpick comments:
In `@tests/functional/test-postgresql.sh`:
- Around line 255-287: The loop currently calls /api/all-instances twice and
assigns to NEW_CLUSTER and PRIMARY_CLUSTER from separate snapshots, causing
flakiness; change to fetch the topology once per iteration (e.g., store the curl
output into a local variable like SNAPSHOT_JSON) and then extract both
NEW_CLUSTER and PRIMARY_CLUSTER from that same JSON payload (use the existing
python3 parsing blocks or jq) so both values come from the identical snapshot
before comparing them and before any re-seed calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5bd71028-cc57-4822-bf9e-15dc1a716448
📒 Files selected for processing (1)
tests/functional/test-postgresql.sh
| # Verify the switchover at the PostgreSQL level, not via orchestrator's | ||
| # cluster view. After a PG graceful takeover the demoted primary is still | ||
| # running (awaiting an operator-managed restart with standby.signal), so | ||
| # orchestrator sees two roots — one per former cluster — and a "find RO=false | ||
| # in original cluster" check returns the same host both times. | ||
| SWITCHOVER_OK=false | ||
|
|
||
| # pgstandby1 must have been promoted (no longer in recovery) | ||
| PROMOTED=$($COMPOSE exec -T pgstandby1 psql -U postgres -tAc "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]') | ||
| if [ "$PROMOTED" = "f" ]; then | ||
| pass "pgstandby1 has been promoted (pg_is_in_recovery=false)" | ||
| SWITCHOVER_OK=true | ||
| else | ||
| fail "pgstandby1 still in recovery after switchover (got: '$PROMOTED')" | ||
| fi | ||
|
|
||
| if [ -n "$NEW_PRIMARY" ] && [ "$NEW_PRIMARY" != "$CURRENT_PRIMARY" ]; then | ||
| pass "Primary switched from $CURRENT_PRIMARY to $NEW_PRIMARY" | ||
| # pgprimary must have been set read-only (default_transaction_read_only=on) | ||
| DEMOTED_RO=$($COMPOSE exec -T pgprimary psql -U postgres -tAc "SHOW default_transaction_read_only;" 2>/dev/null | tr -d '[:space:]') | ||
| if [ "$DEMOTED_RO" = "on" ]; then | ||
| pass "pgprimary has default_transaction_read_only=on" | ||
| else | ||
| fail "Primary did not change: was $CURRENT_PRIMARY, now ${NEW_PRIMARY:-unknown}" | ||
| fail "pgprimary default_transaction_read_only=$DEMOTED_RO (expected on)" | ||
| fi |
There was a problem hiding this comment.
Gate SWITCHOVER_OK on both promotion and demotion.
Line 166 sets SWITCHOVER_OK=true after only pgstandby1 promotion. If Line 176 fails, Line 227 still runs the round-trip against a partially completed switchover, contrary to the intended PostgreSQL-level gate.
Proposed fix
# pgstandby1 must have been promoted (no longer in recovery)
+ PROMOTED_OK=false
PROMOTED=$($COMPOSE exec -T pgstandby1 psql -U postgres -tAc "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]')
if [ "$PROMOTED" = "f" ]; then
pass "pgstandby1 has been promoted (pg_is_in_recovery=false)"
- SWITCHOVER_OK=true
+ PROMOTED_OK=true
else
fail "pgstandby1 still in recovery after switchover (got: '$PROMOTED')"
fi
# pgprimary must have been set read-only (default_transaction_read_only=on)
+ DEMOTED_OK=false
DEMOTED_RO=$($COMPOSE exec -T pgprimary psql -U postgres -tAc "SHOW default_transaction_read_only;" 2>/dev/null | tr -d '[:space:]')
if [ "$DEMOTED_RO" = "on" ]; then
pass "pgprimary has default_transaction_read_only=on"
+ DEMOTED_OK=true
else
fail "pgprimary default_transaction_read_only=$DEMOTED_RO (expected on)"
fi
+
+ if [ "$PROMOTED_OK" = "true" ] && [ "$DEMOTED_OK" = "true" ]; then
+ SWITCHOVER_OK=true
+ fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-postgresql.sh` around lines 155 - 177, The gate
SWITCHOVER_OK is set true after checking PROMOTED only, allowing later steps to
proceed even if DEMOTED_RO check fails; change the logic so SWITCHOVER_OK is
only set to true when both conditions succeed: check PROMOTED
(pg_is_in_recovery=false) and then check DEMOTED_RO (SHOW
default_transaction_read_only = "on"), and only set SWITCHOVER_OK=true after the
DEMOTED_RO branch confirms success (or compute SWITCHOVER_OK as the conjunction
of the two checks using PROMOTED and DEMOTED_RO variables); update references to
SWITCHOVER_OK accordingly so downstream round-trip runs only when both promotion
and demotion succeeded.
The round-trip round of the PG functional test hung for 60s on the second graceful-master-takeover-auto: the wait-for-LSN step never completed because the demoted primary's WAL receiver could not authenticate to the new primary. Root cause: PostgreSQLReconfigureStandby / PostgreSQLRepositionAsStandby set primary_conninfo using orchestrator's credentials (PostgreSQLTopologyUser / PostgreSQLTopologyPassword), but the test cluster's pg_hba.conf only had "host replication repl" — no entry for the orchestrator user. Streaming replication was silently rejected. Add "host replication orchestrator all md5" so that once the operator restarts the demoted primary with standby.signal, the primary_conninfo orchestrator set actually authenticates. This matches the operational prerequisite a real deployment would need to satisfy for the PostGracefulTakeoverProcesses hook to work with only a restart.
|
Follow-up commit |
Summary
PR #99's functional tests fail on master across PG 15/16/17. The failing assertion (
Primary did not change: was 172.30.0.20:5432, now 172.30.0.20:5432) is an inherited check that does not work for PG graceful takeover:primary_conninfo. Per the design, the operator completes the switchover via aPostGracefulTakeoverProcesseshook that createsstandby.signaland restarts.Run: https://github.com/ProxySQL/orchestrator/actions/runs/24606553086
Fix
Replace the cluster-view check with direct PostgreSQL-level signals:
pg_is_in_recovery()=fon pgstandby1 (the standby was promoted)default_transaction_read_only=onon pgprimary (the primary was demoted)The existing writability check against pgstandby1 and the
primary_conninfocheck on pgprimary stay.The round-trip section now:
SWITCHOVER_OKderived from the direct PG check.PG_CLUSTER).pg_is_in_recovery()on pgprimary to verify the round-trip instead of the split-brain cluster view.Test plan
Summary by CodeRabbit